feat(models): add WebLLM model provider for on-device browser inference#1036
feat(models): add WebLLM model provider for on-device browser inference#1036jsamuel1 wants to merge 1 commit into
Conversation
Adds a new WebLLMModel provider under @strands-agents/sdk/models/webllm that runs quantized LLMs entirely in the browser via WebGPU using @mlc-ai/web-llm. Models are cached in browser storage after the first download. Includes cache management helpers (downloadWebLLMModel, isWebLLMModelCached, deleteWebLLMModel, listWebLLMModels) so apps can pre-download models from a settings UI and report progress via an onProgress callback. @mlc-ai/web-llm is added as an optional peer dependency to keep it out of the default dependency graph for server-side users. Resolves #1035
| * | ||
| * @throws {@link WebLLMUnavailableError} when WebLLM cannot be loaded. | ||
| */ | ||
| export async function listWebLLMModels(appConfig?: AppConfig): Promise<WebLLMModelInfo[]> { |
There was a problem hiding this comment.
Issue: listWebLLMModels does not call assertBrowserEnvironment() unlike isWebLLMModelCached, deleteWebLLMModel, and downloadWebLLMModel. This is inconsistent — if the module can't be loaded in Node, it will throw WebLLMUnavailableError from loadWebLLMModule() anyway, but the error message won't be the clear "requires a browser" guidance.
Suggestion: Either add assertBrowserEnvironment() for consistency with the other helpers, or add a code comment explaining why listWebLLMModels intentionally skips the check (e.g., if it's designed to work in server-side contexts for listing available models without needing WebGPU).
| } | ||
|
|
||
| return events | ||
| } |
There was a problem hiding this comment.
Issue: The mapChunkToEvents, extractUsage, and the streaming state management logic here are nearly line-for-line identical to mapChatChunkToEvents in src/models/openai/chat-adapter.ts. This creates a maintenance burden where fixes to one must be duplicated to the other.
Suggestion: Consider extracting the shared OpenAI-compatible chunk-to-event mapping into a shared utility (e.g. src/models/openai-compatible-streaming.ts) that both the OpenAI chat adapter and WebLLM can import. At minimum, leave a // NOTE: comment cross-referencing the OpenAI adapter so future maintainers know to keep them in sync.
| modelLib: record.model_lib, | ||
| } | ||
| if (record.vram_required_MB !== undefined) info.vramMB = record.vram_required_MB | ||
| if (typeof (record as unknown as { model_type?: string }).model_type === 'string') { |
There was a problem hiding this comment.
Issue: The model_type access uses a double cast through unknown ((record as unknown as { model_type?: string }).model_type), which is fragile and circumvents type safety.
Suggestion: Since ModelRecord comes from @mlc-ai/web-llm types, either:
- Use optional chaining with an
incheck:if ('model_type' in record && typeof record.model_type === 'string') - Or extend the
ModelRecordtype locally if this field is expected but not yet typed upstream
The current double-cast could silently break if model_type is renamed or restructured.
|
|
||
| if (bufferedUsage) yield bufferedUsage | ||
| if (bufferedStop) yield bufferedStop | ||
| } catch (error) { |
There was a problem hiding this comment.
Issue: The stream method catches errors from the engine and re-throws via normalizeError(error), but if the error occurs during iteration of the async iterable (inside for await), the generator will be in a partially-yielded state. The consumer will see the error, but any buffered modelContentBlockStartEvent won't have a matching modelContentBlockStopEvent, which could leave the SDK's message accumulator in an inconsistent state.
Suggestion: Consider emitting content block stop events in the catch/finally block when state.textContentBlockStarted is true or activeToolCalls is non-empty, to ensure the stream is always well-formed even on errors.
| "@aws-sdk/client-s3": "^3.943.0", | ||
| "@google/genai": "^1.40.0", | ||
| "@modelcontextprotocol/sdk": "^1.25.2", | ||
| "@mlc-ai/web-llm": "^0.2.79", |
There was a problem hiding this comment.
Issue: The peer dependency is specified as "^0.2.79" which for a pre-1.0 package (semver treats 0.x specially) only allows 0.2.x patches. This is correctly conservative. However, @mlc-ai/web-llm has a history of frequent breaking changes within minor versions (their API changed between 0.2.x releases).
Suggestion: Consider whether pinning more tightly (e.g. ~0.2.79 or exact 0.2.79) would be safer, or alternatively document in the module TSDoc which web-llm API surface you depend on. If the intent is to support a range, add a comment in package.json or the README noting the tested/verified version range.
| events.push({ type: 'modelMessageStartEvent', role: delta.role as 'user' | 'assistant' }) | ||
| } | ||
|
|
||
| if (delta?.content && delta.content.length > 0) { |
There was a problem hiding this comment.
Issue: If the stream starts emitting content deltas without a preceding role delta (e.g. some engines skip the role chunk), no modelMessageStartEvent is ever emitted, but content block events are still produced. This would leave the SDK's stream consumer in an inconsistent state.
Suggestion: Add a guard that emits a synthetic modelMessageStartEvent with role: 'assistant' when content arrives before a role delta, similar to how the text content block start is auto-emitted:
if (delta?.content && delta.content.length > 0) {
if (!state.messageStarted) {
state.messageStarted = true
events.push({ type: 'modelMessageStartEvent', role: 'assistant' })
}
// ...
}| return this._enginePromise | ||
| } | ||
|
|
||
| private async _createEngine(): Promise<MLCEngineInterface> { |
There was a problem hiding this comment.
Issue: The _createEngine method calls assertBrowserEnvironment() synchronously, then loadWebLLMModule() which also surfaces an environment error. However, when _getEngine() is called, it caches the promise — if the first call fails (e.g., module not found), it correctly resets _enginePromise allowing retry. But assertBrowserEnvironment() will always throw synchronously in Node, meaning the retry logic is unreachable in that scenario. This is fine but worth noting that the catch reset on line 300 only helps for transient loadWebLLMModule failures, not environment failures.
No action required — just noting for clarity that the retry semantics only apply to module loading/engine init failures in a valid browser environment.
| * | ||
| * @internal | ||
| */ | ||
| export function assertBrowserEnvironment(): void { |
There was a problem hiding this comment.
Issue: assertBrowserEnvironment() checks typeof window === 'undefined' to detect non-browser environments. However, some server-side runtimes (Cloudflare Workers, Deno Deploy) and test environments (jsdom) define window without actually having WebGPU. Conversely, Web Workers (where WebGPU is available) don't have window.
Suggestion: Consider checking for typeof navigator !== 'undefined' && 'gpu' in navigator (or at minimum typeof globalThis.navigator !== 'undefined') as a more accurate browser+WebGPU heuristic, or simply let the CreateMLCEngine call surface WebLLM's own environment check (which it already does) and remove the preemptive check. The error message could also mention Web Workers as a valid environment.
Review SummaryAssessment: Comment (Draft PR - not blocking, providing feedback for iteration) This is a well-structured addition that follows existing model provider patterns closely. The code is clean, well-documented, and thoroughly tested. Review Categories
Good work on the overall design — the cache helper separation, abort signal support, and consistent error class hierarchy are thoughtful touches. |
|
This repository has been merged into the strands-agents/harness-sdk monorepo and will be archived shortly. All new development happens there. If this PR is still relevant, please recreate it against the monorepo. The code now lives under Apologies for the disruption, and thank you for contributing! |
Motivation
WebLLM runs quantized LLMs entirely in the browser via WebGPU, with model weights cached in IndexedDB/CacheStorage after the first download. Without a first-class provider, users building browser-based agents have to wire up
@mlc-ai/web-llmthemselves or reach for the communitywebllm-ai-providerviaVercelModel(0.0.1, ~2 weekly downloads on npm).This adds a
WebLLMModelprovider under@strands-agents/sdk/models/webllmso on-device, offline-capable agents are a one-import experience — matching howBedrockModel,AnthropicModel, etc. are shipped today.Resolves strands-agents/harness-sdk#2481
Public API Changes
New subpath export
@strands-agents/sdk/models/webllmwith aWebLLMModelclass and cache-management helpers.Cache helpers let apps pre-download from a settings UI, check what's cached, and evict models independently of an agent invocation:
@mlc-ai/web-llmis declared as an optionalpeerDependency, so server-side users are unaffected. Attempting to use the provider outside a browser or without the peer installed raises a typedWebLLMUnavailableError.Use Cases
Testing
strands-ts/src/models/webllm/__tests__/model.test.ts— unit tests for streaming/formatting/tool-use paths with a mockedMLCEnginestrands-ts/src/models/webllm/__tests__/cache.test.node.ts— Node-side environment guards and error surfacesstrands-ts/src/models/webllm/__tests__/browser.test.browser.ts— browser smoke teststrands-ts/test/packages/{esm-module,cjs-module}— subpath export resolution for the new./models/webllmentryAll existing suites pass (
2554 passed) alongside the new coverage.Notes
webllm/module.